Uncertainty drilldown

Contents

Uncertainty drilldown#

We have explored various ways to reduce the uncertainty in the emissions data. Now we look at the question: if we want to obtain more precise estimates, which sector should we focus on?

This section presents a simple framework to get started. You will see that the initial suspects are not what you think.

For now, let’s assume that picking each sector in any country is just as easy. In practice of course, the choice of analysis is driven by the data available. Nevertheless, it is a good baseline to understand how hard it would be to reduce this uncertainty.

The source emissions dataset contains qualitative confidence for each source, based on expert assessment of the data. The values are very low confidence, low confidence, medium, high and very high. In our simple baseline, we are going to assume that each of these confidence assessment correspond to a numerical value of uncertainty. This is more complicated in practice, because humans are not great at assessing uncertainty, and also because when the estimates are wrong, they can be very wrong - sometimes by multiple orders of magnitude! Nevertheless, this is a baseline.

%load_ext autoreload
%autoreload 2
import polars as pl
import plotly.express as px
import scipy as sp

from ctrace.constants import *
import ctrace as ct

Let’s focus on 2023 and with CO2e_100year.

year = 2023
gas = CO2E_100YR
gas_conf = "conf_emissions_quantity"
c_gas_conf = C(gas_conf)
sdf_gy = ct.read_source_emissions(gas=gas,year=year)

The confidence intervals are provided for each source. To reduce the workload, we are aggregating first by confidence interval, country, sector and subsector. This is all we need to consider for this analysis and it makes the analysis significantly faster.

Note

Technical All the aggregation keys are enumerations or categories. This makes all the aggregations very fast because Polars knows precisely how many keys will be aggregated and leverages these statistics.

df = (sdf_gy
 .group_by([c_iso3_country, c_subsector, c_gas_conf, c_sector])
 .agg(pl.len().alias("count"), c_emissions_quantity.sum())
 .sort(by=c_emissions_quantity)
 .collect()
)
df
shape: (5_921, 6)
iso3_countrysubsectorconf_emissions_quantitysectorcountemissions_quantity
enumenumenumenumu32f64
"CAN""net-forest-land""high""forestry-and-land-use"2520-4.9558e8
"GBR""net-shrubgrass""high""forestry-and-land-use"1272-7.2577e7
"IRL""net-shrubgrass""high""forestry-and-land-use"1164-6.6705e7
"RUS""net-shrubgrass""medium""forestry-and-land-use"10020-5.9092e7
"RUS""net-wetland""high""forestry-and-land-use"15156-5.5066e7
"IND""electricity-generation""medium""power"59281.2810e9
"CHN""coal-mining""very low""fossil-fuel-operations"215161.3067e9
"USA""electricity-generation""medium""power"253251.4151e9
"USA""road-transportation""low""transportation"377761.4554e9
"CHN""electricity-generation""medium""power"169904.9841e9

We are going to use a very simple mapping from qualitative confidence assessments to quantitative error bounds. We are going to assume that a very high confidence is around 1% standard deviation and that a very low confidence is around 30-50% standard deviation, with a geometric progression in between.

By default, if no confidence is provided, we will assume it is very low. We will err on the side of caution.

Note

Try different numbers. The results of this ananlysis are rather stable to different choices.

margins = {"very high": 0.01, "high": 0.03,"medium":0.07, "low": 0.15, "very low": 0.3}
ERR_MARGIN = "err_margin"

df = (df.with_columns(
    (c_gas_conf.replace_strict(margins, return_dtype=pl.Float32, default=margins["very low"])
      * c_emissions_quantity).alias(ERR_MARGIN)
))

What is the total error? It is around 18%, which is reasonable (the IPCC reports provide a 10% standard deviation).

df.select(c_emissions_quantity.sum(), C(ERR_MARGIN).sum(), C("count").sum())
shape: (1, 3)
emissions_quantityerr_margincount
f64f64u32
3.6539e105.9866e96605664

Let us look at the error margins based on this scheme.

First, notice the switch in positions: electrical power generation, the first post in emissions, is ranked third from the perspective of uncertainties. This makes sense: there are only a few thousands power plants around the world, of which many are in highly regulated countries and have been already monitored for other pollutants such as NOx.

Second, transportation: there are many cars on the road, each of them with different characteristics. It is very hard to understand how they emit.

Now, for the surprises: coal mining. This is the opposite of the canary in the coal mine. Satellites have found that coal mines leak many potent gases such as methane through the ground. This was underestimated in the original inventory assessments. This is an example of nasty surprise in climate accounting, as there is not much we can do to prevent a mine to vent through the ground.

A nearly missing element: the forestry sector. Deforestation is a huge source of emissions through fires, but this does not appear here. Climate TRACE is focusing for V3 on emissions that can definitely be linked to human activities - and there is already a lot to be said.

The agriculture sector is also a big candidate, but interestingly rice cultivation is seemingly not so much of an issue. Fires to croplands and enteric fermentation (cow burps) are much more uncertain.

sec_rankings = (
df
            .group_by(c_sector, c_subsector, c_gas_conf)
 .agg(c_emissions_quantity.sum(), C(ERR_MARGIN).sum())
 .sort(by=ERR_MARGIN, descending=True)  
)
# Building a nice ordre for the graph
_order = (sec_rankings
                 .group_by(c_sector)
                 .agg(C(ERR_MARGIN).sum())
                 .sort(by=ERR_MARGIN, descending=True)[SECTOR].to_list())

px.bar(sec_rankings,
      x=SECTOR,
       y=ERR_MARGIN,
       color=gas_conf,
       category_orders = {SECTOR: _order},
       log_y=False,
       hover_name=SUBSECTOR,
        color_discrete_map={'very low':'black','low': 'darkblue', 'medium': 'royalblue', 'high': 'lightcyan'}
)

What could we focus on? Here is a list in order of uncertainy. Better understanding the impact of fires is going to make a significant difference.

Also, focusing on the Chinese electrical power production is very important. Even if its margin of error is relatively low, this sector is so large that any improvement will have a disproportionate impact.

rankings = (df.group_by(c_iso3_country, c_subsector, c_sector)
 .agg(c_emissions_quantity.sum(), C(ERR_MARGIN).sum())
 .sort(by=ERR_MARGIN, descending=True)
 .with_row_index()
)
rankings.head(10)
shape: (10, 6)
indexiso3_countrysubsectorsectoremissions_quantityerr_margin
u32enumenumenumf64f64
0"USA""road-transportation""transportation"2.0743e94.0398e8
1"CHN""coal-mining""fossil-fuel-operations"1.3067e93.9200e8
2"CHN""electricity-generation""power"5.0054e93.5528e8
3"CHN""road-transportation""transportation"9.2638e81.8151e8
4"CHN""cement""manufacturing"7.8082e81.6110e8
5"USA""electricity-generation""power"1.4728e91.1639e8
6"IND""electricity-generation""power"1.2841e99.0614e7
7"CHN""chemicals""manufacturing"2.8828e88.1751e7
8"JPN""road-transportation""transportation"4.0113e88.0934e7
9"CHN""cropland-fires""agriculture"2.6813e88.0439e7

Grouping by country, we have have a discussion about which sectors are the most uncertain.

In the US, road transportation is a surprising source of uncertainty that should be relatively easy to address.

_plot_data = (rankings
        .filter(~(c_sector == FORESTRY_AND_LAND_USE)))
_country_order = (_plot_data
                 .group_by(c_iso3_country)
                 .agg(C(ERR_MARGIN).sum())
                 .sort(by=ERR_MARGIN, descending=True)[ISO3_COUNTRY].to_list())
px.bar(_plot_data.filter(c_iso3_country.is_in(_country_order[:10])), x=ISO3_COUNTRY,
       y=ERR_MARGIN, 
       color=SECTOR,
       hover_name=SUBSECTOR, 
       category_orders = {ISO3_COUNTRY: _country_order},
       log_y=False)

Conclusion#

We saw in this notebook how to define levels of uncertainty from qualitative assessments. This gives us areas of focus: electrical generation in China, transportation in the US, …

It also underlines - again - the complexity around forestry and land uses. Vegetation is both the largest sink of carbon at scale and the least understood. There is much to understand in that area.